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ABSTRACT 



Context. Photometric redshifts (photo-z's) have become an essential tool in extragalactic astronomy. Many current and upcoming 
observing programmes require great accuracy of photo-z's to reach their scientific goals. 

Aims. Here we introduce PHAT, the PHoto-z Accuracy Testing programme, an international initiative to test and compare different 
methods of photo-z estimation. 

Methods. Two different test environments are set up, one (PHAT0) based on simulations to test the basic functionality of the different 
photo-z codes, and another one (PHAT1) based on data from the GOODS survey including 18-band photometry and ~ 2000 spectro- 
scopic redshifts. 

Results. The accuracy of the different methods is expressed and ranked by the global photo-z bias, scatter, and outlier rates. While 
most methods agree very well on PHAT0 there are differences in the handling of the Lyman-o' forest for higher redshifts. Furthermore, 
different methods produce photo-z scatters that can differ by up to a factor of two even in this idealised case. A larger spread in ac- 
curacy is found for PHAT1. Few methods benefit from the addition of mid-IR photometry. The accuracy of the other methods is 
unaffected or suffers when IRAC data are included. Remaining biases and systematic effects can be explained by shortcomings in the 
different template sets (especially in the mid-IR) and the use of priors on the one hand and an insufficient training set on the other 
hand. Some strategies to overcome these problems are identified by comparing the methods in detail. Scatters of 4-8% in Az/(1 + z) 
were obtained, consistent with other studies. However, somewhat larger outlier rates (> 7.5% with Az/(1 + z) > 0.15; > 4.5% after 
cleaning) are found for all codes that can only partly be explained by AGN or issues in the photometry or the spec-z catalogue. Some 
outliers were probably missed in comparisons of photo-z's to other, less complete spectroscopic surveys in the past. There is a general 
trend that empirical codes produce smaller biases than template-based codes. 

Conclusions. The systematic, quantitative comparison of different photo-z codes presented here is a snapshot of the current state-of- 
the-art of photo-z estimation and sets a standard for the assessment of photo-z accuracy in the future. The rather large outlier rates 
reported here for PHAT1 on real data should be investigated further since they are most probably also present (and possibly hidden) in 
many other studies. The test data sets are publicly available and can be used to compare new, upcoming methods to established ones 
and help in guiding future photo-z method development. 



1. Introduction 

The estimation of redshifts from photometry alone is an old idea 



TEsti ng Program (STEP; |Heymans et"aLl[2006l |Massey et al\ 
2007) and led to important improvements in the methodology 
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ly et al. 1995b. It has come a long way from being a 



rarely used technique for special kinds of objects to a major tool 
now widely used for a multitude of observational programmes. 

Not only can this photometric redshift (photo-z) approach 
yield redshifts of fainter objects than accessible by spectroscopy, 
but also the efficiency in terms of the number of objects with red- 
shift estimates per unit telescope time is largely increased. These 
two properties make photo-z's extremely attractive for observing 
programmes depending on redshifts for a large number of faint 
galaxies if these redshifts do not have to be as precise as spec- 
troscopic redshifts (spec-z's). 

Still the requirements on the accuracy of photo-z's for up- 
coming surveys are formidable. Photo-z's are essential in con- 
straining dark energy (DE) by weak gravitational lensing and 
can be used for other DE probes such as galaxy clustering, super- 
novae of type la, and the mass function of galaxy clusters as well 
( jAlbrecht et~aL||2006[ |Peacock et aT1|2006| >. Surveys of galaxy 
formation and evolution also depend on photo-z's to study these 
processes as a function of environment and to probe to fainter 
levels than with spectroscopy alone. To fully exploit the power 
of these huge, future data sets, photo-z's with a very low level of 
residual systematics are needed (e.g. Hute rer et al.|2006| . 

There are many aspects which influence the performance of 
photo-z's. The choice of an observing strategy sets the theoret- 
ical limit for the accuracy. Choosing the filters and distributing 
the available observing time over the different filters to reach 
certain depths can have a great impact on photo-z's. Accurate 
photometric calibration is of great importance as well as the re- 
moval of effects of the different point-spread-function (PSF) in 
the different bands. Varying column densities of galactic dust 
over the survey area have to be accounted for before a photo-z 
code can be expected to perform at its best. 

Here we would like to ignore all these effects as much as 
possible and concentrate on the last link in the chain, the photo-z 
methods themselves. It is clear that the two regimes - data and 
method - cannot be separated cleanly because there are connec- 
tions between the two. For example, it is highly likely that one 
method of photo-z estimation will perform better than a second 
method on one particular data set while the situation may well be 
reversed on a different data set. Whenever such a situation arises 
in the following we will try to alert the reader to that. 

The methodology behind photo-z's is developing fast with 
ever more complex methods yielding results of increasing accu- 
racy. In this context it is important to set a standard to compare 
the different methods to each other in order to make quantitative 
statements about their differences and to take a snapshot of to- 
day's state-of-the-art. Such comparisons and rankings can then 
be used to identify the most promising approaches and to con- 
centrate on their further improvement. 

In this paper we present an international initiative named 
PHAT (PHoto-z Accuracy Testing Jjwhich was initiated to carry 
out such a quantitative comparison. A very similar initiative has 
been carried out for shape measurement algorithms in the Shear 
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of measuring galaxy shapes for weak gravitational lensing ap- 
plications. Similar but much more hrnited blind tests of photo- 
z's have been performed by Hogg et al. ( 1998 1 on spectroscopic 
data from the Keck telescope on the Hubble Deep Field (HDF), 
by |Hildebrandt et al.| ([2008 ) on sp ectroscopic data from the 
VIMOS VLT Deep Survey ( VVDS; |Le Fevre et al-pOOl) and 
the FORS Deep Field (FDF; |Noll et al.|2004|i, and bylAbdalla 



et al. (2008b ) on the sample of Luminous Red Galaxies from the 
SDSS-DR6. 

In the framework of PHAT we provide standardised test en- 
vironments to the photo-z community which consist of simu- 
lated or observed photometric catalogues alongside with addi- 
tional material like filter curves, SED templates, and training 
sets. These data sets can be used in a blind (or semi-blind, i.e. 
with support of a training set) test by the participants to esti- 
mate redshifts with their favourite codes. Two such test steps 
have been carried out so far. The first one called PHAT0 is based 
on a highly idealised simulation representing an easy case to test 
the most basic elements of photo-z estimation and to identify 
possible low-level discrepancies between the methods. The sec- 
ond test called PHAT1 is based on real data originating from the 
Great Observatories Origins Deep Survey (GOODS, Giavalisco 



et al. 2004[ ) representing a much more complex environment 
pushing photo-z codes to their limits and revealing more sys- 
tematic difficulties. 

PHAT was conceived as an open competition. The test data 
sets are publicly available over the PHAT website and all major 
photo-z groups in the astronomical community were informed 
of the initiative via email. Furthermore, PHAT was advertised 
on several meetings and workshops to increase its visibility. The 
photo-z codes presented here were not selected by the PHAT 
coordinators but reflect the interest of the community in such a 
competition. This strategy led to an impressive feedback of 21 
participants submitting results obtained with 17 different photo- 
z codes. After a large number of results was collected for each 
test data set, the results of all codes were published on the PHAT 
website. But the test data sets are still kept blind (i.e. the individ- 
ual redshifts are retained) to allow further participants to meet 
the same conditions. 

First we shortly summarise every photo-z method that was 
used within PHAT (Sect. [2). Then in Sect. [3]&[4] the motivation 
behind the tests, the data sets, and the results are described in de- 
tail for PHAT0 and PHAT1, respectively. In Sect.[5]we conclude 
and give an outlook to future activities within PHAT. We use AB 
magnitudes throughout. 



2. Methods 

In the following we describe the different methods that were 
used to estimate photo-z's from the catalogues presented in 
Sect. [3]& [4] A summary of the methods can also be found in 
Table]!] together with the three-letter acronyms that are used in 
the remainder of the paper to identify the codes. The third small 
letter indicates whether the code belongs to the empirical codes 
(-e), which are trained on the colours of a sub-sample of ob- 
jects with accurate redshift estimates (e.g. spec-z's), or to the 
codes fitting SED templates to the observed photometry (-t). It 



2 Most empirical codes offer the flexibility of using also any other 
photometric observable like e.g. size, concentration, or surface bright- 
ness. Since we only use magnitudes in PHAT we skip this detail in the 
remainder of Sect. [2] 
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should be noted that this distinction is somewhat fuzzy. A num- 
ber of codes include ingredients from both regimes. We just keep 
this terminology because it has been widely used in the litera- 
ture. For a more rigorous description of the underlying concepts 
in photo-z methods and their common properties see |Bud avari 
(2009). 

Note that the descriptions of the different template sets of 
the template SED fitting codes in the following subsections only 
apply to PHAT1. For PHATO the template set was provided and 
it was used by every participant with a template-based code. 

2.1. BPZ(BP-t) 



BPZ (Bayesian Photo-z's; Benftez 2000) introduced the use of 
Bayesian inference and priors to photometric redshift estima- 
tion. The code uses a prior P(z, T | mo) which gives the like- 
lihood that given an apparent magnitude mo, a galaxy would 
have redshift z and SED type T. As an example of how the prior 
works, bright objects and ellipticals are assumed unlikely to be at 
high redshift. For each galaxy, this information is combined (in 
a Bayesian manner) with the likelihood P(C \ z, T) of observing 
the galaxy colours C for each redshift and SED pair, yielding the 
final P(z, T \ C, mo). By marginalising over T, P(z) is obtained 
along with the most likely redshift Zb and its uncertainties. For 
the PHAT tests, BPZ version 1.99.3 is used, a slightly updated 
version of that used in the Coe et al. (2006 ) UDF analysis. 



- Templates: The |Coe et al. ( 2006[ ) SED templates are used 
with BPZ, which include a CWW+SB SED template set 
(similar to that used in PHATO with Kinney et al.'s SB1 
replaced by SB3) as introduced in [Benftez ( 2000| ) and re- 
calibrated by Benftez et al. (2004) plus two younger starburst 
templates from Bruzual & Chariot (2003 ) added in Coe et al. 
(2006 ). Note that the empirical CWW+SB templates as well 
as the synthetic BC03 templates include emission lines. No 
dust extinction was added to the BC03 templates. Between 
each of the eight adjacent templates two interpolated tem- 
plates are added, for a total of 22 templates. Beyond 25600A, 
the majority of the templates are undefined and must be ex- 
trapolated. Thus it cannot be expected that these templates 
provide good fits to IRAC photometry of low redshift ob- 
jects. 

- Prior: For PHATO, a flat prior is used. The prior was calcu- 
( 2000 ) based on objects with spec-z in the 



lated by |Benftez 



CFRS (Lilly eta 



.|1995) and HDF-N (Willi ams et aT[l 996). 
It was shown to yield results superior to the "flat" prior im- 
plicitly assumed by maximum likelihood (or "frequentist") 
methods. 

- Training: No training with the model-z's/spec-z's was per- 
formed. 

2.2. BPZ (BP2-t) 

BPZ is run on PHAT1 a second time with a different template set 
and additional training. 

- Templates: The second library (Benftez 2010, in preparation) 
uses as starting point a set of 6 templates from PEGASE 
(Fioc & Rocca-Volmerange 1997) selected to be similar to 
the Coe et al. ( 2006[ ) templates. This library is further cal- 
ibrated using the FIREWORKS photometry and spectro- 
scopic redshifts from |Wuyts et al.| ( |2008[ ). Note that these 
templates include emission lines and dust extinction. 

- Prior: Same as BP-t. 



- Training: The templates are compared to the photometry of 
the spec-z training set and new zero points are estimated, as 
in |Coe et al.| ( |200 6). We also measure the amount of excess 
scatter in the predicted vs measured colours compared with 
that expected from the catalogue photometric errors and typi- 
cal template uncertainties (Brammer et al. 2008| ). This excess 
scatter is included in the photo-z estimation as a zero point 
uncertainty. 

2.3. EAZY(EA-t) 

EAZY (Bramm er et al.|2008| ) is a template-fitting code designed 
to produce un-biased photometric redshift estimates for deep 
multi-wavelength surveys that lack representative calibration 
samples with spectroscopic redshifts. 

- Templates: EAZY uses a unique template set derived using 



the non-negative matrix factorisation algorithm (Sha et al. 
|2007||Blanton & Roweis|2007| ) trained on synthetic photom- 
etry from the semi-analytic light-cone produced by De Lucia 



& Blaizotj (|2007[). These templates can be considered the 



principal component spectra of all galaxies at < z < 4 in 
the light-cone, allowing for subtle differences between local 
and high-redshift galaxy samples. EAZY is able to reproduce 
complex star-formation histories by fitting non-negative lin- 
ear combinations of the templates. The templates include 
emission lines following the prescription of Ilbert et al. 
( |2009l ). 

- Template error function: Template mismatch is addressed 
with a "template error function", which assigns lower 
weights at rest-frame wavelengths where the template cali- 
bration is uncertain or where the templates are not expected 
to fully reproduce observed galaxy colours. This feature is 
particularly important when using mid-IR (IRAC) photom- 
etry, which samples wavelengths where the observed emis- 
sion can be dominated by non-stellar (i.e. dust) sources not 
included in the templates. 

- Prior: EAZY adopts a prior equal to the normalised redshift 
distribution of galaxies in the |De Luci a & Blaizot (2007 ) 
semi-analytic light-cone at a given apparent R or K magni- 
tude. This is akin to a luminosity prior under the assumption 
that the light-cone reasonably reproduces the galaxy lumi- 
nosity function. 

- Training: No training with the model-z's/spec-z's was per- 
formed. 

2.4. GALEV and GAZELLE (GA-t) 



GAZELLE ( |Kotulla & Fritze||200"9"l Kotulla, in preparation) is 
based on a x L minimisation algorithm to compare the observed 
SEDs to a large library of GALEV evolutionary synthesis models 
( |Kotulla et al.|2 009). GAZELLE also accounts for inherent uncer- 
tainties in the model grid, e.g. due to uncertainties in the stellar 
evolution data and stellar spectral libraries, by assuming a 0.1 
mag uncertainty in all filters. 

- Templates: GALEV includes a full suite of emission lines 
( |Anders & Fritz e 2003 ), a detailed treatment of the attenu- 
ation due to intergalactic HI ( |Madau|1995 i and optionally a 
chemical evolution model. This combination allows to not 
only estimate photometric redshifts, but at the same time 
physical parameters (stellar masses, star formation rates, 
etc.) for each galaxy in a consistent manner. Masses and 
mass-dependent parameters are computed by scaling model 
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values with the scaling factor derived from matching the 
overall normalisation of the template fluxes relative to the 
observed fluxes. For the PHAT1 run the model grid included 
5 undisturbed models for E and Sa-Sd type galaxies supple- 
mented with a set of 21 models encountering a strong star- 
burst at galaxy ages of 0.5 to 10 Gyrs, followed by subse- 
quent post-starburst phases. All models assume star forma- 
tion to begin at z = 8; for the undisturbed models a chemi- 
cally consistent evolution (see Kotulla & Fritze 2009 for de- 
tails) is chosen, for the burst models a metallicity fixed to 
half the solar value is used. All templates include the full 
evolution from the onset of star formation until the present 
day and the Calzetti et al. (2000) dust extinction description 
is chosen. Emission lines are included as well. 

- Filter weighting: To avoid complications at wavelengths be- 
yond the rest-frame K-band where dust emission becomes 
increasingly important, only filters that cover the rest-frame 
K-band or shorter wavelengths are included, effectively ig- 
noring some of the Spitzer filters at low-redshift. 

- Prior: No prior is included that might affect the resulting 
redshift distribution. 

- Training: No training with the model-z's/spec-z's was per- 
formed. 



2.5. GOODZ (GO-t) 

The GOODZ code (Dahlen et al. 2010, in preparation) is a devel- 
oped version of the code used by Dahlen et al. (2005 2007 1 to 
calculate photometric redshifts in the GOODS-S. The code is 
based on the template fitting method and allows the inclusion of 
Bayesian priors based on the expected shape of the galaxy lu- 
minosity function. Similar to this investigation, GOODZ uses the 

Coleman et all dl 980} and two 



four empirical templates from 
templates from (|Kinney et al.| 1996| their templates SB2 and 



SB3). The code also uses available spectroscopic redshifts to 
correct for offsets between fluxes extracted in different filters or 
instruments. Such offsets may be significant when combining 
data from different instruments with varying PSF or pixel-scales 
and may uncorrected lead to increased scatter or biases in the 
photometric redshifts. The spectroscopic redshifts are also used 
to adjust the input set of template SEDs using a method similar 
to |Ilbert et al.] l |2006) . 

- Templates: GOODZ is only run on PHAT0 so that no individual 
template set is associated with this code. 

- Prior: No prior was used. 

- Training: No training with the model-z's was performed. 



2.6. Hyperz (HY-t) 

Hyperz is a publicly available code based on SED templates fit- 
ting using a standard x 2 minimisation method. The codes uses 
the observed fluxes of an object in a set of given filters and com- 
pares them with the theoretical fluxes of galaxies in the same 
filters obtained from template spectra, either synthetic or empir- 
ical, taking into account the observational uncertainties but also 
the possible observational hidden effects such as reddening or 
IGM opacity. It computes not only a best-fit solution which min- 
imises the differences, therefore a most probable photometric 
redshift, but also a full probability function as a function of red- 
shift. The code and the method have been tested and described 



modelling but can be easily adapted to use any kind of parame- 
ters that would fit the needs of the user. Its simplicity has brought 
Hyperz to be extensively used and tested since its launch, and 
even to be used beyond the pure computation of photometric 
redshifts. 



Templates: Hyperz comes with two standard template sets, 
one based on the synthetic stellar population library of 
Bruzual & Chariot ( |1993[ > and the other one consisting of 



extensively in Bol zonella et al.| (|2000 ) and further practical de- 
scription can be found in its users manual. Hyperz comes with a 
given set of templates, filters, reddening laws and Lyman forest 



the four empirical templates from |Coleman et aL ( |1980[ ). For 
the PHAT1 test, the latter empirical library was chosen and it 
was supplemented with two starburst templates from |Kinney| 
et al.| ( [T996| (templates from both libraries include emission 
lines). This set of six basic template was further enlarged by 
applying different amounts of extinction to the templates ac- 
cording to the Calz etti et al.| p000) dust extinction law. 
Prior: No prior was used. 

Training: No training with the model-z's/spec-z's was per- 
formed. 



2.7. Kernelz (KR-t) 

This method is a hybrid incorporating aspects of both template- 
based and empirical codes, though it is most similar in design to 
BPZ and other Bayesian methods. As in standard template-based 
codes model colours are computed for a set of galaxy SEDs at 
a set of fixed redshifts. However, then this grid of colours is 
treated as if they were individual galaxies. For each test galaxy 
the points are weighted by a factor that is akin to a Bayesian 
prior, accounting for the expected probability of seeing such a 
galaxy given the apparent magnitude and type of the test point. 
Redshifts are then estimated using kernel regression, construct- 
ing a weighted average redshift, with weights proportional to 
their proximity to the template points in colour space. The ker- 
nel bandwidth is chosen by cross-validation using the training 
set of galaxies with known redshifts. Results presented here rep- 
resent code that is still in development. Details of the kernel re- 
gression method for both empirical and hybrid techniques will 
be described in Schmidt & Brewer (in prep). A promising ex- 
tension that improves the method by allowing for data adaptive 
kernels will be described in Udaltsova & Schmidt (in prep). A 
public release of the code is also in the works. 

- Templates: Because Kernelz was still in development when 
the results were submitted, simple templates from |Coleman| 
|etal.| ( |1980"] l and |Kinney et al.| ( [1996] l (both of which include 
emission fines) with some extrapolation to IRAC wave- 
lengths were used. 

- Prior: An empirical prior trained on data from VVDS was 
used. In practice, this is very similar to the prior described in 
lllbertet al.| < |2006> . 

- Training: The spectroscopic data was used to choose the ker- 
nel bandwidth alone, no tweaking of templates or zero points 
was performed. 

2.8. Le Phare (LP-t) 

The p ublic code Le Phare ( |Arnouts et~aL]|2002| |Ilbert et al] 
2006 ) is primarily dedicated to estimate photo-z's, but it can also 
be used to estimate physical parameters like stellar masses and 
infrared luminosities. Le Phare is based on a standard template 
fitting procedure. The templates are redshifted and integrated 
through the instrumental transmission curves. The opacity of the 
IGM is taken into account and internal extinction could be added 
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as a free parameter to each galaxy. The photo-z's are obtained by 
comparing the modelled fluxes and the observed fluxes with ax 2 
merit function. A probability distribution function is associated 
to each photo-z. 

For the PHAT1 sample, we adopted a configuration similar 
to the one used in the COSMOS field ( |Ilbert et al.|2009) : 



2.10. Purger (Template Repair) (PT-t) 



- Templates: The set of templates was generated by Polletta 
leTaTI ( |2007) with the cod e GRASIL (|Silva et al'||1998) . 
The 9 galaxy templates of Polletta et al. ( 2007) > include 3 
SEDs of elliptical galaxies and 6 templates of spiral galax- 
ies (SO, Sa, Sb, Sc, Sd, Sdm). Those were complemented 
with 12 additional blue templates generated with Bmzual & 
Chariot ( 2 003|l. Fou r different dust extinction laws were ap- 
plied UPrevot et d.p984| |Calzetti et aI1[2000l and an addi- 
tional bump at 2 175 A), depending on the considered tem- 
plate. Emission lines were added to the templates using rela- 
tions between the UV continuum, the star formation rate and 
the emission line fluxes (Kennicutt 1998 ). 

- Prior: No prior on the redshift distribution was applied. 
However, no redshift solution which would produce a galaxy 
brighter than M(B) = -24 was allowed. Such a prior would 
create catastrophic failure for some QSOs, but it was not ex- 
plicitly intended to estimate photo-z's for QSOs (no AGN 
templates were included in this run), although the PHAT1 
catalogue contains some (see below). 

- Training: An automatic calibration of the zero-points was 
performed using the spec-z sample. The calibration is ob- 
tained by comparing the observed and modelled fluxes 
(Ilbert et al. 2006). The calibration is done iteratively until 
convergence in the zero-points values is reached. This step 
helps in removing bias. 

2.9. LRT (LR-t) 



LRT (Low-Resolution Spectral Templates Assef et al. 2008 
2010} is a set of subroutines intended for estimating K- 



corrections and photometric redshifts using a basis of empiri- 
cal low resolution SED templates (hence LRT) for galaxies and 
AGNs. In this basis, every galaxy is represented by a non- 
negative linear combination of three empirically determined 
SED templates that resemble an elliptical, an Sbc spiral and an 
Im irregular galaxy. Given the nature of the tests in the PHAT 
initiative, the AGN SED template was not used. For the PHATO 
testing phase, the LRT subroutines were modified to do a simple 
X 1 minimisation to fit each template to the data separately rather 
than fitting a non-negative combination of them. 

- Templates: The templates were derived from the extensive 
broad-band and spectroscopic observations of the NOAO 
Deep Wide-Field Survey ( jJannuzi & Dey|1999 1 Bootes field 
and range in wavelength between 0.03 and 30pm. In the 
PHAT1 testing phase, the LRT subroutines were used with 
the SED templates derived in Assef et al. (2008) which have 
a shorter wavelength range (0.1-10/mi) than the newer ver- 
sions presented in Assef et al. (2010). These newer SED 
templates also integrate an AGN component with variable 
extinction. 

- Prior: For estimating photometric redshifts, the LRT subrou- 
tines also use a simple luminosity function prior, which is by 
default based on the R-havA luminosity function of Lin et al. 
( [1996) . 

- Training: No training with the model-z's/spec-z's was per- 
formed. 



Originated from the template-based method described in Csabai 
|et al.| ( [2003] l, this method uses synthetic colours calculated from 
the given spectral energy distribution templates. A common ap- 
proach for template fitting is to take a small number of spectral 
templates T and choose the best fit by optimising the likelihood 
of the fit as a function of redshift, type, and luminosity, p(z, T, L). 
Here a variant of this method is used that incorporates a continu- 
ous distribution of spectral templates, enabling the error function 
in redshift and type to be well defined. 

- Templates: This code is only run on PHATO so that no indi- 
vidual template set is associated with this code. 

- Prior: No prior was used. 

- Training: No training with the model-z's was performed. 



2.11. ZEBRA (ZE-t & ZE2-t) 

ZEBRA (Zurich Extragalactic Bayesian Redshift Analyzer 
|Feldmann et al.|2006| > is a freely available, open source photo- 
metric redshift code based on a SED template-fitting approach. 
Built on top of a traditional Maximum Likelihood ansatz it in- 
troduces and combines several novel methods that help to im- 
prove the accuracy of photometric redshift estimates for galax- 
ies and AGNs (see e.g. |Oesch et al][20T0l |Luo et al.|[20TO] for 
some recent applications). First, ZEBRA is able to detect and cor- 
rect photometric offsets in the input catalogue. Second, ZEBRA 
can use spectroscopic redshifts on a small fraction of the pho- 
tometric sample to iteratively correct the original set of input 
templates. This template correction step has been shown to be 
a crucial ingredient in decreasing the bias, the scatter, and the 
number of outliers in the redshift estimation (e.g. Feldmann et al. 
2006; Mobasher et al. 2007 1. Third, when run in Bayesian mode 
ZEBRA computes the prior in redshift-template space in a self- 
consistent manner from the input catalogues and the redshift- 
template likelihood functions. This prior is consequently used to 
derive the posterior probability distribution of each input object. 
Here, since ZEBRA participates only in PHATO, it is run in its ba- 
sic Maximum Likelihood mode and with the provided templates. 
The following set of parameters are used. The redshifts are al- 
lowed to vary in steps of 0.002 from to 4. The filter bands 
are mildly smoothed using a top-hat filter with FWHM of 20A. 
Finally, the spectral flux densities weighted with photon energy, 
not photon counts, are computed using the -flux-type=l option. 
For the ZE2-t runs the redshift stepping is reduced to 0.001 and 
no smoothing of the filter bands is performed. 



- Templates: ZEBRA is only run on PHATO so that no individual 
template set is associated with this code. 

- Prior: No prior was used. 

- Training: No training with the model-z's was performed. 
2.12. ANNz (AN-e) 



ANNz (Collist er & Lahav||2004| l is an empirical photo-z code 
based on artificial neural networks. Such a network is made up 
of several layers, each consisting of a number of nodes. The first 
layer receives the galaxy magnitudes as inputs, while the last 
layer outputs the estimated photometric redshift. The layers in 
between could consist of any number of nodes each. The nodes 
are inter-connected, and every connection carries a 'weight', 
which is a free parameter in the parametrisation. When a net- 
work is trained the weights of all node connections are deter- 
mined by minimising a cost function E. To avoid an over-fitting, 
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every network is tested on a validation set of galaxies, whose 
spectroscopic redshifts are also known. The network with low- 
est value of E as calculated on the validation set is selected and 
the photometric sample is run through it for redshift estimation. 
An error bar is assigned to each photo-z via a chain rule (see 
|Collister & Lahav|2004| for details). Neural networks have been 
used e.g. for estimation of photo-z's for the SDSS (Collister et al.| 



2007 pyaizu et al.T2 008; Ab dalla et al.|2008b| l, as well as fore- 



casts of photometric redshifts for future surveys like the Dark 
Energy Survey ( Banerji et al.||2008 i and Euclid (Abdalla et al. 
[2008al >. 

A neural network architecture of N:2N:2N: 1 was used for the 
PHAT tests where N is the number of filters for which there are 
input magnitudes. Different architectures were tested, but this 
did not lead to any substantial improvement in the results. The 
choice of architecture is fully justified by tests done in Firth et al. 
( |2003] > and [Collister & Lahavl ( |2004t . 

2.13. BDT(DT-e) 



The Boosted Decision Tree (BDT) algorithm ( Gerdes et al.|2 010) 
is a training-set-based method that combines an ensemble of 
weak classifiers into a single, powerful classifier. The spectro- 
scopic training set is first divided into redshift bins whose width 
is approximately half the expected photo-z resolution of the al- 
gorithm for the given sample. We have found that a finer binning 
choice does not improve the resolution. For each bin, a set of 
trees is trained intended to recognise as "signal" those galaxies 
whose redshift falls within the bin in question, and "background" 
those that fall more than 2cr away from the signal bin, where <x 
is the iteratively-determined photo-z resolution. As training vari- 
ables we use the observed magnitudes in each band. The pro- 
cess of constructing an individual tree begins with a root node 
containing all the training galaxies. The root node is then split 
into two subsamples by placing a cut on the one variable that 
best separates the sample into signal and background. Each new 
node is subsequently split in this way until the nodes reach a 
certain minimum size. The result is a tree containing nodes with 
predominantly signal and predominantly background galaxies. 
The process of "boosting" iteratively repeats this process, giving 
higher weight to galaxies that were initially misclassified. The 
overall signal probability of a galaxy is then obtained by com- 
bining the classification output from approximately 50 trees in 
each photo-z bin, where higher weight is given trees with lower 
misclassification rates in the training set. 

The method produces a photo-z probability for each galaxy 
as a function of redshift. This method therefore yields not only 
an estimate of the best photo-z and error, but a reconstruction of 
the full redshift PDF, P(z). In |Gerdes etaLlpOlOl ) it was shown 
that the BDT algorithm improves upon the default photo-z's in 
the SDSS spectroscopic sample, and that the PDFs yield a more 
accurate reconstruction of the redshift distribution N(z). 



2.14. Wolf (Empirical x 2 ) (EC-e) 



The method of Wolf (2009 ) derives PDFs from empirical mod- 
els and is a subclass of kernel regression methods. It mimics 
a template-based ^-technique with the main difference that an 
empirical dataset is used in place of the template grid. Each ob- 
ject in the empirical set contributes to the observed object with a 
quantified probability. The PDF of redshifts thus obtained can be 
used in its entirety or investigated for ambiguities. Here, it is just 
reduced to an expectation value and RMS in redshift. Any kernel 



approach requires to choose a kernel function which also acts as 
a smoothing scale to the discrete empirical model grid. Here, we 
used a Gaussian kernel function with cr m = O^l. However, ax 2 - 
method is correctly implemented if the kernel function applied 
to the model makes its density distribution match that of the ob- 
served sample (see the matched error scale in Sect. 6 of | Wolf 
2009, for details). As a consequence, redshift distributions of 
object samples can be reconstructed potentially accurate within 
Poisson noise of the sample sizes, which would also imply no 
bias exceeding random noise. 

2.15. Purger (Nearest-Neighbour Fit) (PN-e) 

This empirical method compares the observed colours to the ref- 
erence set. The estimation method first searches the colour space 
for the k nearest neighbours of every object in the estimation set 
(i.e. the galaxies for which we want to estimate redshift) and then 
estimates the redshift by fitting a local low order polynomial to 
these points. An improved version of this code is using a k-d 
tree index for fast nearest neighbour search (Cs abai et al.|2007) . 
It was used to calculate photometric redshifts for the SDSS Data 
Release 7 ( Abazajia n et al.|200 9). The advantage of this method 
versus a template-based method might be the better estimation 
accuracy, but it cannot extrapolate, so the completeness of the 
reference set is crucial. For this reason, we have used the large 
training set available for the PHAT0 test. 

The estimation was done using the large, simulated data set 
using 150 nearest neighbours. A small number of outliers was 
automatically excluded from the regression on the neighbour 
sets. 



2.16. Li (Polynomial) (PO-e) 



This empirical photo-z method is based on Li & Yee (2008), 
which uses a polynomial fit so that the galaxy redshift is ex- 
pressed as the sum of its magnitudes and colours. Different from 
Li & Yee ( 2008| l where the training set galaxies are divided into 
several fixed colour-magnitude cells, here the coefficients of the 
photo-z polynomial are derived individually for each galaxy by 
choosing a subset of training set galaxies whose magnitudes and 
colours are closest to the input galaxy. They are chosen based on 
quadratically summed ranks of colour and magnitude differences 
between the training set galaxies and the input galaxy. All mag- 
nitudes and independent colours are used. Note that each train- 
ing set galaxy has an equal weight in the fit. This may introduce 
a redshift bias to input galaxies near the edges of the colour- 
magnitude distributions. Therefore, a better approach would be 
to assign weights to the chosen training-set galaxies based on 
the inverse value of their final rank, but this has not been imple- 
mented for PHAT. 



2. 1 7. Carliles (Regression Trees) (RT-e) 



The RT-e method by |Carliles et al.| ( |2010| l is based on Random 
Forests which are an empirical, non-parametric regression tech- 
nique. A Random Forest builds an ensemble average of ran- 
domised regression tree redshift estimates. Bootstrap samples 
are created by sampling from the training set with replacement, 
and each regression tree is trained on its own bootstrap sample. 
Given a new test object, each regression tree produces its own 
redshift estimate, and these estimates are averaged to yield the fi- 
nal Random Forest redshift estimate. This technique also results 
in Gaussian errors, and this behaviour has a strong theoretical 
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statistical explanation. Intuitively speaking, a given new galaxy 
can be considered to be drawn from the space of inputs (colours, 
magnitudes, etc.) by redshifts. This space is the event space, and 
for that new galaxy one can hypothesise the existence of a distri- 
bution over the event space, unique to that galaxy, which reflects 
the similarity of the new galaxy (minus the unknown redshift) 
to any given point in the event space. The Random Forest ap- 
proximates this distribution per object, and the process results in 
easily computable per-object error parameter estimates. 

For the PHAT tests a leaf size of 5 was chosen and 50 trees 
were used. 



2.18. Singal (Neural Network) (SN-e) 

The primary motivation for the development of this code was to 
treat additional available galaxy information beyond photomet- 
ric data, for example shape parameters, on an equal footing with 
the photometric data (as it was done in e.g. Collister & Lahav 
2004, |Ball et al.|2004| >. The package, although still undergoing 
modification, is a multi-layer perceptron neural network for the 
IDL environment. The IDL code can be relatively easily modi- 
fied, and could in principle be optimised for a variety of input 
data situations. As training convergence is relatively slow in this 
network, it is most useful in situations where a robust training 
set is available from the outset. 

As implemented here, the network has an input layer of neu- 
rons which accepts the magnitudes in each band. The input layer 
treats all input information on an equal footing, normalising 
across all objects in the training set so that the inputs for each 
neuron on the input layer are distributed between and 1. There 
are two hidden layers of 30 neurons each, and an output layer 
with a single neuron obtaining a value between and 1 which is 
a proxy for the estimated redshift, with the linear conversion de- 
fined during the training when the known redshifts of the training 
set are supplied subject to the conversion. 



3. PHATO - a highly idealised simulation 

3.1. Motivation 

The lowest algorithmic level of the codes can be tested if the 
photometry is bias-free and everything except for the redshifts 
is provided. In this way the choice of template sets, the use of 
priors, etc. do not play a role and code-specific problems can 
be disentangled from other effects. To this end, simulations with 
synthetic photometry are set up with the LP-t photo-z code (see 
Sect. Oil. 




Fig. 1. Template set used for the PHATO test (arbitrary flux nor- 
malisation). 



3.2.1. Template set 



The empirical template set by Coleman et al. ( 1980) has been 
used extensively in different photo-z studies. As in the case of 
LP-t ( jllbert et al.|[2006l > and BP-t ( |Benitez|[2000l ) we decided 
to supplement this template set by two templates for starburst 
galaxies from Kinney et al. ( 1996). The template SEDs are dis- 
played in Fig. T] 

It should be noted that the choice of the template set is 
not critical in this test because the template set is provided to 
the participants using template-based codes and the very large 
training set (see below) covers densely the whole SED-redshift 
space. This particular set is chosen here because it is one of the 
most widely used sets for photo-z's in its original, extended, and 
modified (re-calibrated) form. Participants using template-based 
codes were explicitly asked to use this particular template set for 
the PHATO test and switch off any priors within their codes. 



3.2. Data set 

In order to keep things simple PHATO is based on a very lim- 
ited template set and a long wavelength baseline. A noise-free 
catalogue with accurate synthetic colours is provided as well as 
a catalogue with a low level of additional noise. Furthermore, 
we added a very large training set to ensure that also empirical 
photo-z algorithms find an ideal environment. The ingredients 
are detailed in the following. 

Everything but the redshifts for the test data set was revealed 
to the participants. In particular, the template set (Sect. 3.2. 1 1 and 



the filter curves (Sect. 3.2.2 1 were provided, and details about 
the construction of the catalogues (Sect. 3.2.3| & [3.2.4[ e.g. the 
used IGM recipe) were revealed. The participants were explicitly 
asked to use those ingredients if applicable to make their setup 
as comparable to the simulation setup as possible. 



3.2.2. Filter set 

For the PHATO test we want to avoid systematic effects that 
can arise in photo-z's because of an insufficient coverage in 
wavelength. For example, colour-redshift degeneracies (see e.g. 
Bemtez 2000) can occur between high- and low -redshift if in- 
frared (IR) and/or ultraviolet (UV) bands are not available. 

Thus, the filter set used here spans the whole range from 
near-UV to mid-IR (see Fig.|2]i. We choose the wgn'z-bands from 
MEGACAM mounted at th e CFHT (|Boulade et alj|2"003"] l, the 
YJHK-bmds of UKIDSS ( |Lawrence et al.||2007| l, and the two 
bluer bands of the IRAC camera mounted on the Spitzer Space 
Telescope (Fazio et al.|2004) >. Again this choice is not too critical 
since the filter curves are provided and one of the tests does not 
include any noise at all and the other one includes just a low level 
of noise in the photometry. 
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Table 1. Methods used for photo-z estimation within PHAT 



Acronym Participant 



Code 



Reference 



Public 



BP-t Coe, D. BPZ, Bayesian Photometric Redshifts 

BP2-t Benitez, N. BPZ, Bayesian Photometric Redshifts 

EA-t Brammer, G. EAZY, Easy and Accurate Redshifts from Yale 

GA-t Kotulla, R. GALEV, GALaxy EVolution 

GO-t Dahlen, T. GOODZ 

HY-t Miralles, J.-M. Hyperz 

KR-t Schmidt, S. Kernelz, Kernel Regression 

LP-t Arnouts, S. Le Phare 
Ilbert, O. 

LR-t Assef, R. LRT, Low-Resolution Spectral Templates 

PT-t Purger, N. Template Repair 

ZE-t Feldmann, R. ZEBRA, Zurich Extragalactic Bayesian Redshift Analyzer 

ZE2-t Gillis, B. ZEBRA, Zurich Extragalactic Bayesian Redshift Analyzer 



Bem'tez (20001; Coe et al. (2006) 
Benitez ( 2 000}; Benite z 2010 in prep. 
Brammer et al. ( 2008 1 
Kotulla et aTj|2O09]> 
Dahlen et al. (200 5T|2007} 
Bolzonella et al.||2000] > 



Schmidt & Brewer (in prep) 
|Ilbert et al.|{2006| > 



Assef etal.H2008||2010 


\ 


Adelman-McCarthy et a 




2007 



Feldmann et al. (2006) 
Feldmann et al. (2006) 



AN-e 

DT-e 
EC-e 
PN-e 
PO-e 
RT-e 
SN-e 



Abdalla, F. 
Banerji, M. 
Gerdes, D. 
Wolf, C. 
Purger, N. 
Li, I. H. 
Carliles, S. 
Singal, J. 



ANNz, Artificial Neural Network 

BDT, Boosted Decision Trees 
Empirical x 1 
Nearest-Neighbour Fit 
Polynomial Fit 
Regression Trees 
Neural Network 



Collister&Lahav(2004) 



Gerdes et al. 1 2010 1 

Wolt|(|2009|> 

7\rjazauariet al.| ( |2009 
Li & Yee|(2DD8] 
Carliles et af} i |2010} 



http 



http 



c http 
d [http" 
" [http" 
f http 



http 



http 
http 
http 

[http 



//acs.pha. jhu. edu/-txitxo/ 



/ /www . astro .yale . edu/eazy/ 



//www. galev.org/ 

/ /webast . ast . obs-mip . fr /hyperz/ 



; version 1.99.3 usedforPHAT: http://www.its.caltech.edu/-coe/BPZ/ 



/ /www . cfht . hawaii . edu/~ arnouts/lephare . html 
/ /www . astronomy . ohio- state . edu/~rjassef/lFE/| 



//skyserver . elte . hu/PhotoZ/| 



/ /www . exp- astro . ph ys ■ ethz . ch/ZEBRA/J 

//www. homepages. ucl . ac .uk/~ucapola/annz .html 



//www. sdss. jhu. edu/~carliles/photoZ/ 
/ /www . slac . Stanford . edu/ - j acks7| 
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3.2.3. Noise-free catalogue 

One of the most simple tests one can think of is to compare the 
redshift estimates of different codes for data with infinite signal- 
to-noise (S/N) and thus perfect colours. In this way the agree- 
ment of the basic interpolation- and convolution-algorithms in 
template-based codes can be tested. Any differences found in 
such a basic test will probably propagate to more realistic se- 
tups. 

We use the LP-t code as a reference to create such a cat- 
alogue evenly distributed over the six templates and over the 
redshift range < z < 4 including the effect of absorption by 
the intergalactic medium (IGM) following the recipe by Madau 
( 1995 ). The model redshifts were revealed to the participants for 
this test. 

It should be noted that inaccurate redshift estimates from one 
of the codes only mean that this particular code does not agree 
perfectly with LP-t. Which of the two codes is inaccurate (or 
whether even both are inaccurate) cannot be decided with such a 
test. 

3.2.4. Catalogue with noise 

To study the influence of noise on the results, a more realis- 
tic catalogue is set up as well. We adopt a parametric form 
for the signal-to-noise as a function of magnitude which be- 
haves as a power-law at bright magnitudes and an exponen- 
tial at faint magnitudes. The transition regime is defined by 



the parameters (m*,err*). At magnitude m < m+, we adopt 
err{m) = 10°- 4(QV 's''' +1)(m ~ m * ) , and at magnitude m > m+, we use 
err(m) = |^|.exp(10 1! ' /D "" (m_ '"* ) ), where alight and afaint are the 
slopes at bright and faint magnitudes respectively. The adopted 
values for each filter are reported in Table|2] while the behaviour 
of the Signal-to-Noise (S/N = 1.086/err) for the different pass- 
bands is shown in Fig [3] (colour coded from u band, in cyan to 
4.5/ira, in red). The noisy magnitudes are randomly drawn as- 
suming a Gaussian distribution in flux with mean and standard 
deviation (flux, err(flux)). 

To generate the simulated catalogue, the galaxies are dis- 
tributed according to r-band luminosity functions for the differ- 
ent spectral types. However, for simplicity in the comparison of 
the different codes, we do not apply any dust attenuation for the 
star-forming galaxies and we do not let the luminosity functions 
evolve with redshift. Thus, this simulated catalogue is not ex- 
pected to provide a realistic distribution of low and high redshift 
galaxies. Note, that we do include the averaged Lyman absorp- 
tion by the intergalactic medium as a function of redshift, follow- 
ing [Madau] JT995]) which will affect the blue bands at high red- 
shift. The catalogue has been cut to objects brighter than r = 24, 
so that only reasonably high-S/N sources are included. The red- 
shift distribution attains a smooth shape with a peak at interme- 
diate redshifts and few objects beyond z - 1.5. 

The final catalogue consists of ~ 11 000 objects for which 
the redshifts are not revealed to the participants. Furthermore, a 
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Table 2. Filters used for the PHATO test 



t 1 1 1 1 — i — i — | 1 1 1 1 r 

u g r i z Y J H K 3.6 4.4 




J 




5000 



5x10" 



x[A] 



Fig. 2. Transmission curves of the filter set used for the PHATO 
test 




20 25 

Mag JB (Reference) 

Fig. 3. Signal-to-noise model used for the PHATO test 



much larger training set of ~ 170000 objects with exactly the 
same properties as the original catalogue is provided. 

3.3. Results for the noise-free case 

In the following we will present the results of three different 
template-based codes on the noise-free catalogue that were sub- 
mitted after the release. The training of empirical codes on noise- 
free data often does not make sense. That is probably the reason 



Filter 


Instrument 


m* 


err* 


^bright 


Q'faint 


u 


MEG AC AM @ CFHT 


27.0 


0.2 


-0.25 


0.22 
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0.22 
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Fig. 5. Opacity curves used by LP-t (solid) and HY-t and BP-t 
(dashed) for a redshift of z = 3.5. 



why no results for empirical codes on the noise-free data have 
been submitted to PHAT. 

The results are summarised in Fig.|4]showing the model red- 
shift z mo dei against the redshift estimate z p hot and the redshift dif- 
ference Az = Z mode l - Zphot ■ 

The ZE2-t code shows nearly perfect agreement with LP-t 
in this test in terms of redshift estimates. This suggests strongly 
that the basic interpolation of the filter- and template-curves and 
their subsequent convolution by the two codes leads to colour 
estimates that agree very well. Also the modelled attenuation of 
the IGM seems to be identical in both codes. 

Up to a redshift of z ~ 2.5 the agreement between LP-t and 
HY-t/BP-t is close to perfect as well. For higher redshifts there 
are considerable discrepancies between LP-t on the one hand and 
HY-t and BP-t on the other hand. 

A further analysis shows that especially the blue templates 
with considerable UV flux get assigned grossly wrong redshift 
estimates. At a redshift of z ~ 2.5 the Lyman-c line enters our fil- 
ter set. These two facts suggest that the handling of the IGM, i.e. 
the opacity of the Lyman-o' forest, is implemented differently in 
the codes. Although all codes refer to the paper of Madau ( 1995 ), 
it turns out that HY-t and BP-t use an analytic approximation of 
the opacity curve. As described in that paper the opacity curve 
can be approximated by a step-function with depression factors 
Da and Db shortward of Lyman-c and Lyman-yS, respectively, 
and a complete absorption shortward of the Lyman-limit. LP-t 
uses the full opacity curve instead (binned for redshift intervals 
of Az = 0.1). See Fig.|5]for a comparison of the opacity curves 
for a redshift of z = 3.5. 
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Fig. 4. Results of the PHATO test for the noise-free catalogue 
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The scatter around the mean opacity curve for a given red- 
shift is rather large (see Fig. 3 of Madau 1995 i due to clustering 
of the IGM. Thus, for practical applications we do not expect 
either method to perform superior over the other one as long 
as a direct relation between opacity and redshift is assumed. To 
account for the greatly varying optical depth of the IGM for dif- 
ferent lines-of-sight at a fixed redshift in a realistic application, 
one certainly would have to vary opacity as another free param- 
eter. The discrepancies reported here just appear in this artificial 
test without noise and a fixed opacity -redshift relation. However, 
different residuals between model and observation might well 
be present in applications of photo-z codes with a fixed opacity- 
redshift relation to real data. 



3.4. Results for the catalogue with noise 

We select the best fit or most likely photo-z estimate from each 
method. Some methods provide estimates of confidence in their 
photo-z's in the form of redshift uncertainties or probability dis- 
tributions P(z) and/or template quality of fit measurements like 
X 2 . These can help identify and prune those photo-z estimates 
most likely to be outliers. However these confidence measures 
are not performed consistently or universally among the various 
methods, so we do not consider them here. 

The error distribution of photo-z's is usually non-Gaussian 
with extended tails and some catastrophic outliers with grossly 
wrong redshift estimates. To summarise this distribution by a 
few numbers is not always possible. Here we express the photo-z 
accuracy in terms of the mean and the RMS scatter of the quan- 



Table 3. Results for the PHATO catalogue with noise 



Acronym 



bias 
0.000 



scatter 
0.010 



outlier ratt 
0.044% 



BP-t 

EA-t 

GA-t 

GO-t 

HY-t 

LR-t 

PT-t 

ZE-t 

ZE2-t 



-0.005 
-0.001 

0.000 

0.000 
-0.002 

0.000 
-0.005 

0.000 
-0.005 



0.011 
0.012 
0.014 
0.012 
0.013 
0.011 
0.011 
0.011 
0.011 



0.026% 
0.000% 
0.053% 
0.018% 
0.185% 
0.026% 
0.053% 
0.062% 
0.044% 



AN-e 


0.000 


0.011 


0.018% 


DT-e 


-0.004 


0.019 


0.389% 


PN-e 


0.000 


0.017 


0.053% 


PO-e 


0.001 


0.019 


1.669% 


RT-e 


0.000 


0.013 


0.010% 


SN-e 


-0.005 


0.049 


18.202% 



Outliers are defined as objects with |Az| = |z mode i - Zphotl > 0.1. 



tity Az = Zmodei - Zphot (after rejection of outliers), and an outlier 
rate, as it was done in many former studies. These statistics for 
the different codes can be found in Table [3] Figure [8] shows the 
scatter and outlier values in comparison. We define all objects 
with a redshift estimate that differs by more than 0. 1 from the 
model redshift, i.e. |Az| = |z mot j e i - Zphotl > 0.1, as outliers. We 
refer the reader to the diagrams in Figs.|6]&|7]showing the com- 
plete error distribution. 
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Fig. 6. Results of the PHATO test for the catalogue with noise, z p h ot vs. z mo d e i- Note that LP-t (top-left panel) was used to create the 
simulations and should be regarded as a reference. 




3.4.1. Results from LP-t (Arnouts) 

In order to set a standard to which the performance of all other 
codes can be compared to, we run LP-t on the catalogue with 
noise that was created by the code itself. It is reasonable to regard 
the accuracy reached by LP-t on this catalogue as a theoretical 
limit set by the amount of noise put in (see Sect. |3.2.4 1. The 
results are displayed in the first panels of Figs . [6] & [7J alongside 
the results from the other codes. 



3.4.2. Results from the other codes 

The numbers in Tableland the observed error distributions dis- 
played in Figs. [6] & [7] suggest that most codes tested here per- 
form similarly to LP-t. Note that there is some degeneracy be- 
tween the scatter values and the outlier rates. No significant bias 



is produced by any of the codes. All bias values are smaller than 
0.5%. Looking at the scatter values and outlier rates four differ- 
ent groups can be identified: 



1 . A large number of codes (AN-e, BP-t, GO-t, EA-t, LR-t, RT- 
e, PT-t, ZE-t, ZE2-t) performs very similarly to LP-t with 
scatter values only slightly larger and outlier rates that are 
very similar or even smaller. This can be regarded as essen- 
tially identical performance because the low numbers of out- 
liers are strongly affected by shot-noise. Note that the outlier 
rates of these codes correspond to - 7 out of ~ 1 1 000 ob- 
jects! 

2. Some other codes (GA-t, HY-t, PN-e) show larger values in 
both statistics than LP-t, but the differences are still minor 
and not very significant. 
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Fig. 7. Results of the PHATO test for the catalogue with noise, Az = z mo dei - z P hot vs. z mo dei- Note that LP-t (top-left panel) was used 
to create the simulations and should be regarded as a reference. 
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3. The codes DT-e, PO-e yield scatter values that are larger by 
a factor of two and outlier rates that are much larger than the 
LP-t statistics, with DT-e yielding a smaller outlier rate than 
PO-e. 

4. SN-e performs worse but is still in the development phase. 

In the following we discuss the problems occurring in the last 
two groups. 

3.4.3. Problems 

The panels for DT-e of Figs. [6] & [7] clearly show that the code 
performs very similar to the codes from groups 1 . & 2. for red- 
shifts z m odei ~ 1-1- For larger redshifts the training set becomes 
more and more sparse. The division into branches of the decision 
tree hence becomes less precise. For the highest redshift inter- 



val only one branch is established so that objects from a rather 
large range in z mo dei are all assigned the same z p hot- This particu- 
lar feature of the DT-e code leads to the slightly worse statistics 
reported in Table [3] 

The empirical code PO-e (see Sect. |2.16| l is based on a 
second-order polynomial fit of the colour-redshift relation. This 
leads to a very limited number of degrees of freedom (66 in the 
PHATO case with 1 1 bands) compared to the number of objects 
in the training set|^Not all the information included in the train- 
ing set can be reflected by the 66 coefficients so that this empir- 
ical code performs worse in this test than other empirical codes 
(e.g. AN-e) that feature many more degrees of freedom. 



3 Note that PO-e was trained on a much smaller training set with 
~ 1200 objects. 
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Fig. 8. Scatter and outlier values for the catalogue with noise of 
PHAT0. The inlet shows the region in the lower left as a blow- 
up, but due to shot noise the performance of most the codes in 
the inlet should be regarded as identical. 



The SN-e code was developed for a low redshift (z < 1.5) 
dataset with robust colours and galaxy shape information, and is 
not currently optimised for high redshift and/or noisy data that 
is photometric only, as was the case with the PHAT datasets. 
However, it was useful to examine its unoptimised performance 
with the PHAT data, as an indication of the extent to which op- 
timisation of the network characteristics to a given input data 
scheme matters. 



4. PHAT1 - a test on GOODS data 

4.1. Motivation 

The estimation of photo-z's is special in the sense that the de- 
sired answer can in principle be obtained through spectroscopic 
observations. Thus, we have an accurate benchmark which we 
can compare photo-z's to and we do not have to rely fully on sim- 
ulations. This is a very different situation from other estimation 
problems in astronomy, e.g. the estimation of shapes of galaxies 
for weak gravitational lensing, where accurate knowledge of the 
intrinsic shape is inaccessible for comparison. 

Given the high complexity of the photo-z approach and the 
multiple factors that influence the results it is reasonable to test 
the photo-z codes on real photometric data of objects that have 
also been observed spectroscopically for precise redshift mea- 
surements. In this way the tendency of simulations to idealise 
certain aspects of real data can be avoided. 

As a note of caution it should, however, be mentioned 
that comparisons of photo-z's to spec-z's might well draw a 
somewhat idealised picture of photo-z performance. The cur- 
rently available spectroscopic catalogues are only highly com- 
plete at bright magnitudes. For fainter magnitudes the fraction 
of high-quality spectroscopic redshift measurements decreases. 
As [Hlldebrand t~et~aLl ( [200 8 ) showed, the objects missing in the 



spec-z catalogues are likely the ones for which also photo-z es- 
timation is harder and photo-z accuracy is worse. We chose the 
GOODS-N field also for the reason that it is one of the regions 
of the sky with the most complete spectroscopy down to faint 
limits. 



4.2. Data set 

The imaging data for this test are part of the Great Observatories 
Origins Deep Survey northern field (GOODS-N, 



Giavalisco 

|et al.|2004) . The original four-band, optical ACS data are com- 
plemented with images at other wavelengths from a variety of 
instruments. See Table [4] for a summary. In total, there are data 
in 18 bands covering the near-UV to the mid-IR. 

The photometry used in the PHAT1 test is drawn from Capak 
[etaLl ( |2504l l which includes U, Bj, Vj, R c , Ic, z' and HK' pho- 
tometry. Deep J, and H band photometry taken with ULBCAM 
on the UH2.2m (Wan g~et al.| 2006]) and K s band photometry 
taken with WIRC on Palomar ( Bundy et al.||2005 1 were added 
by first PSF matching then measuring photometry in 3" di- 
ameter apertures using the method described in |Capak et al. 
( |2004] >. The GOODS-ACS photometry in F435W (B), F606W 



(V+R), F775W (O, and F850LP (z') along with the IRAC data 
(Moustakas et al. private Communication) were added by po- 
sitionally matching the catalogues provided by the GOODS 
team with the Capak et al. (2004) catalogues using a 1" match- 
ing radius. Following recommended practice, the SExtractor 
MAG^UJTO magnitudes were used for the ACS data, while the 
aperture corrected 3.6" diameter aperture magnitudes were used 
for IRAC. 

For this stage of testing we wanted to use publicly avail- 
able data that could be obtained with minimal effort by an av- 
erage researcher. The results of this test illustrate the critical role 
that photometric methods play in obtaining good photo-z's. We 
strongly recommend care in obtaining photometry across im- 
ages with variable and very different PSFs. Images should be 
aligned, the PSFs matched, and fluxes measured in consistent 
apertures and care should be taken to ensure noise estimates are 
correct ([Capak et al.||2004| [2007} |Wolf et al.||2"004"} |Fernandez- 



Soto et al.||2001| i. As illustrated by our test on one of the best 
studied fields in the sky, correctly measured pan-chromatic pho- 
tometry is not generally available. Users will likely have to, and 
probably should, measure their own photometry to ensure the 
best results. This is made simpler by automated tasks such as 
ColorPro ( Coe et al. 2006 1 which measure PSF matched aperture 
photometry for a combination of space and ground based data, 



while more complicated routines such as TFIT (Laidler et al. 
2007 ) fit high resolution galaxy images using the local PSF for 
each image. 

Bulk photometric offsets were removed by minimising the 
offset between the predicted and measured photometric points as 
a func tion of rest frame magnitude as described in Capak et al. 
( |2007) p]The resulting photometry has mean systematic offsets 
between photometric bands smaller than 0.01 mag. However, 
close inspection of the photometric catalogue shows that there 
is a fraction of objects which show a rather large discrepancy 



4 Note that this procedure is only mildly dependent on the Capak 
et al. ( 2007 ) template set used for the re-calibration because the redshift 
range of the training sample is broad. For a given template SED the 
same rest frame wavelength corresponds to many different observer's 
frame wavelengths so that systematic features in a template get dis- 
tributed evenly over many filters. Only BP-t, HY-t, and KR-t use tem- 
plate sets that are somewhat similar to the Capak et al. ( 2007 1 template 
set. 
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Table 4. Filters used for the PHAT1 test. 



~~ ' ' 1 ' ' ' ' 1 ' ' 'All 1 Zspec 
Frac (Am>0.3) =s= 15.0 16.9 % 
Frac (Am>0.5) = 8.8 10.9 % 




Fig. 9. Difference between the average ACS (mean of F606W, 
F775W, and F850LP) and average SUPRIMECAM (mean of 
RIz) magnitudes as a function of redshift in the PHAT1 cata- 
logue. 



between the ACS- and the SUPRIMECAM-photometry in the 
optical. Those objects are essentially evenly distributed in red- 
shift. A fraction of 15% (10%) of the objects shows a differ- 
ence of > 0.3mag (> 0.5mag) between an object's average ACS 
magnitude (mean of F606W, F775W, and F850LP) and an av- 
erage SUPRIMECAM magnitude (mean of RIz), as displayed 
in Fig. [9] Some of these objects might be variable, while others 
might be affected by different blending in the space- and ground- 
based bands. We do not filter these objects because they are also 
included in photometric catalogues that are routinely used for 
many science projects. We want to provide estimates of photo-z 
accuracy that are as close to reality as possible and such mis- 
matches of photometry from different instruments (or also differ- 
ent bands of the same instrument) are not exceptions but rather 
the norm. Such issues reflect the complex problem of obtain- 
ing a good photometric catalogue from multi-band imaging data 
taken with different cameras and/or taken under different observ- 
ing conditions. But we will comment upon the impact of these 
objects on global photo-z performance in the following sections 
and mention some strategies to prune them. 

The photometric c atalogue is matched t o different spectro- 



Wirth et al. 



2004) 



scopic catalogues from lCowie et al.|(|2004| 
Treu et al.||2005| , and |Reddy eTaklpUoef 3 ! This yields "atotal 
of 1984 objects with 18-band photometry and spectroscopic red- 
shifts. We randomly select a quarter of those objects as a train- 
ing set, i.e. for the release of the catalogue the spectroscopic 
redshifts of one quarter of the objects are revealedFlThe magni- 



which includes spec-z's from 



Cohen et al. 



1 19961 |2000l; 



Cohen 



( 2001] ; [Phillips et aL|(Tr997);|Lowenthal et al.|<|19 97 1998 1; Dickinson 
I F^;[Liu"et al.| < |1999| >; |Barger et al.| ( |2000||2001||2003| l; |Steidel et al. 
i T996]|2003| > ~ 



Blain et al. 



(2004) 



which includes spec-z's from 
7 It should be noted that this is a fairly small training set for such a 
large redshift range. It cannot be expected that empirical codes perform 



AB 

jr 



Filter Instrument 

~U MOSAIC @KPNO-4m 

B SUPRIMECAM @ Subaru 

V SUPRIMECAM @ Subaru 

R SUPRIMECAM @ Subaru 

/ SUPRIMECAM @ Subaru 

Z SUPRIMECAM @ Subaru 

F435W ACS@HST 

F606W ACS@HST 

F775W ACS@HST 

F850LP ACS@HST 

J ULBCAM@UH-2.2m 

H ULBCAM@UH-2.2m 

HK QUIRC@UH-2.2m 

K WIRC@Hale-5m 

3.6/mi IRAC@Spitzer 

4.5yum IRAC@Spitzer 

5.8yum IRAC@Spitzer 

8.0/zm IRAC@Spitzer 



27. 
26. 
26.8° 
26.6" 
25.6° 
25.4" 
27.. 
27.. 
27.1* 
26.6* 

24. f] 
23. F 
22.1° 
22.f 

25. r 

25.8^ 
23.0 e 
23.0 e 



a 5-cr in a circular aperture with a diameter of 3" 

* 10-cr in a circular aperture with a diameter of Of. '2 

c 5-cr for a point-source 

d 5-cr for a Gaussian profile with FWHM=1'.'3 

e 10-cr for a point-source 



tude and redshift distributions are shown in Fig. 10 Note that the 
catalogue is highly complete down to R ~ 24. The PHAT1 cat- 
alogue does not only contain normal galaxies. There is a small 
number of AGN in the sample which we explicitly decided to 
include. 

The participants are asked to run their codes twice on the 
provided catalogue, once including the IRAC bands and once 
without the IRAC bands. This is done because many template 
sets are inaccurate in the mid-IR and we do not want this effect 
to dominate the comparisons. Unlike in PHAT0 the participants 
using template-based codes were asked to choose the best pos- 
sible template set for their code in PHAT1. Thus, template sets 
differ between the different "-t" methods here. 

4.3. Results for the 14-band case 

We use a similar set of statistics as for the PHAT0 test to charac- 
terise the performance of the photo-z's on the PHAT1 data with 
two differences: 

- We report the bias and scatter of Az' = Zspcc 



1+Zsp, 



- Outliers are defined as objects with |Az'| > 0.15. 

The resulting statistics are summarised in Table [5f^f1 and the 
scatter and outlier values are plotted in Fig.[TT]for the full sam- 
ple and for an R < 24 magnitude-limited sample. The full er- 



ror distributions are displayed in Figs. 12 & 13 for the 14-band 
case (i.e. without the IRAC bands). The results for the empirical 
codes only include the non-training objects whereas the results 
for the template-based codes include all objects. We checked 
the performance of the template-based codes on the training and 
non-training sample and found no significant differences. 



as well on such a data set as template-based codes. This should not be 
regarded as a deficiency in the codes but rather a deficiency in the data. 

8 In Table [fi] results are presented for a relaxed definition of outliers 
being objects with | Az' | > 0.5. 
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Fig. 10. 7?-band magnitude- (left) and redshift-distributions {right) of the PHAT1 catalogue (solid) and the training sub-sample 
(dotted). 
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Table 5. Results for the PHAT1 catalogue with and without the 
IRAC bands, and for all objects and a magnitude-limited sample 
with R < 24. 







18-band 




14-band 


18 


-band R < 24 


14-band R < 24 


Code 


bias 


scatter 


outlM 


bias 


scatter 


outl." 


bias 


scatter 


outl." 


bias 


scatter 


outl. fl 


BP-t 


-0.046 


0.060 


30.9 (27.7) 


0.011 


0.048 


11.4 (7.1) 


-0.053 


0.055 


31.3 


0.012 


0.044 


6.7 


BP2-t 


0.003 


0.041 


10.4 (7.5) 


0.004 


0.041 


10.2 (7.8) 


0.003 


0.035 


6.4 


0.005 


0.035 


5.9 


EA-t 


0.020 


0.042 


11.6 (5.9) 


0.022 


0.042 


13.5 (7.1) 


0.021 


0.037 


7.0 


0.023 


0.037 


8.8 


GA-t 


-0.009 


0.061 


23.1 (18.1) 


0.016 


0.059 


19.3 (15.5) 


-0.012 


0.059 


18.3 


0.018 


0.057 


14.6 


HY-t 


-0.001 


0.058 


18.5 (15.2) 


0.018 


0.055 


14.7 (10.1) 


-0.002 


0.055 


15.7 


0.019 


0.054 


10.9 


KR-t 


-0.008 


0.053 


19.7 (13.3) 


-0.006 


0.053 


16.7 (9.8) 


-0.010 


0.049 


15.4 


-0.008 


0.050 


9.2 


LP-t 


0.004 


0.040 


7.7 (4.9) 


0.009 


0.038 


9.2 (4.7) 


0.005 


0.036 


3.9 


0.009 


0.034 


4.5 


LR-t 


0.024 


0.061 


14.8(12.9) 


0.038 


0.055 


18.8(15.9) 


0.021 


0.058 


9.2 


0.039 


0.051 


14.4 


AN-e 


-0.010 


0.074 


31.0 (29.0) 


-0.006 


0.078 


38.5 (36.5) 


-0.013 


0.071 


24.4 


-0.007 


0.076 


32.8 


EC-e 


-0.001 


0.067 


18.4(15.3) 


0.002 


0.066 


16.7(13.3) 


-0.006 


0.064 


14.5 


-0.003 


0.064 


13.5 


PO-e 


-0.009 


0.052 


18.0 (14.5) 


-0.007 


0.051 


13.7 (9.4) 


-0.009 


0.047 


10.7 


-0.008 


0.046 


7.1 


RT-e 


-0.009 


0.066 


21.4(19.0) 


-0.008 


0.067 


24.2 (21.6) 


-0.012 


0.063 


16.4 


-0.012 


0.064 


18.4 



" Percentage of objects with |Az'| = 
bers for the cleaned sample excluding objects 



^| > 0.15. 
with 



The num- 
discrepant 



ACS/SUPRIMECAM photometry are given in brackets. 



The most striking feature in Fig. 12 and Table [5] is the large 
fraction of outliers (> 9% of the total sample) with catastroph- 
ically wrong photo-z's. This fraction is higher than typical lit- 
erature estimates. It should be emphasised that some of the ob- 
jects included here are unusual in the sense that they have SEDs 
different from normal galaxies (e.g. AGNs). A small fraction is 
also influenced by blending effects in the ground-based bands or 
variability, so that there is a mismatch between the ACS and the 
SUPRIMECAM optical photometry. There may also be a very 
small number of objects with wrong spec-z's. But the bulk of 
the outliers are real. If we reject objects which have discrepant 
photometry between ACS and SUPRIMECAM (see Sect. |4~2] i 
the outlier rates decrease considerably as indicated by the val- 
ues in brackets in Table [3] The bias is largely unaffected by this 
filtering and the scatter values do not decrease by more than 
10% (both not given in Table|5]l. We also test the most accurate 
code in PHAT1 (LP-t) without ACS photometry. The statistics 



of the problematic objects do not improve significantly although 
excluding ACS removes the discrepancy between overlapping 
optical filters. This suggests that most of the outliers amongst 
these objects are not just outliers because their photometry is 
corrupted, but rather because it is intrinsically harder to esti- 
mate photo-z's for them. We leave the detailed characterisation 
of these peculiar objects (their morphology, SEDs, remaining 
photometric issues, etc.) to a future study. 

A lot of codes seem to have problems with identifying cor- 
rectly the redshifts of objects from the Reddy et al. ([2006) sam- 



ple with 1.5 S> z 5= 3. We explicitly decided to include those 
objects in the test in order not to artificially idealise the situ- 
ation. PHAT was conceived to give a realistic picture of what 
can be achieved with today's techniques. Those outliers reported 
here are present in deep photometric catalogues and it is a deli- 
cate task for every scientist to remove those or account for their 
effect. The fact that literature values of outlier rates are usually 
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Fig. 11. Scatter and outlier values for the 14- (crosses) and 18-band (squares) PHAT1 case. The arrows indicate the effect of adding 
the IRAC bands on photo-z accuracy. The left panel shows the statistics for all objects and the right panel the ones for all objects 
with an /-band magnitude R < 24. 
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Table 6. Same as Table^but with a relaxed criterion for outliers. 







18-band 






14-band 




18 


-band R < 24 


14-band R < 24 


Code 


bias 


scatter 


outl " 


bias 


scatter 


outl/' 


bias 


scatter 


outl." 


bias 


scatter 


outl." 


TIT) . 

Br-t 


n no a 
-0.054 


U. 122 


5.9 (5.0) 


O.OlO 


U.U85 


4.5 (5.0) 


-0.095 


0.112 


5.8 


-0.098 


0.112 


5.8 


BP2-t 


0.009 


0.084 


3.8 (2.4) 


0.011 


0.081 


3.6 (2.4) 


0.008 


0.072 


1.5 


0.008 


0.072 


1.5 


EA-t 


0.023 


0.088 


4.2 (2.0) 


0.026 


0.092 


5.5 (2.0) 


0.024 


0.074 


1.9 


0.024 


0.074 


1.9 


GA-t 


-0.014 


0.125 


8.7 (5.9) 


0.030 


0.106 


7.7 (5.9) 


-0.026 


0.115 


5.4 


-0.026 


0.115 


5.4 


a Y-t 


(T (1| 1 

-0.01 1 


nil/; 

0.1 to 


4.y (4.2) 


U.U2 / 


n nr»o 

u.uys 


A O fA T\ 

4.5 (4.2) 


-O.OlD 


0.109 


3.5 


-0.016 


0.109 


3.5 


KR-t 


-0.015 


0.114 


8.6 (5.9) 


-0.003 


0.105 


6.9 (5.9) 


-0.024 


0.101 


6.6 


-0.024 


0.101 


6.6 


LP-t 


0.003 


0.079 


2.3 (1.4) 


0.011 


0.079 


3.7(1.4) 


0.005 


0.060 


1.0 


0.005 


0.060 


1.0 


LR-t 


0.028 


0.104 


4.5 (4.0) 


0.054 


0.098 


7.6 (4.0) 


0.023 


0.087 


2.5 


0.023 


0.087 


2.5 


AN-e 


-0.036 


0.151 


3.1 (2.4) 


-0.035 


0.173 


4.2 (2.4) 


-0.047 


0.130 


1.4 


-0.047 


0.130 


1.4 


EC-e 


-0.007 


0.120 


3.6 (3.1) 


-0.003 


0.114 


3.6 (3.1) 


-0.015 


0.106 


1.9 


-0.015 


0.106 


1.9 


PO-e 


-0.013 


0.124 


3.1 (2.3) 


0.001 


0.107 


2.3 (2.3) 


-0.020 


0.098 


1.2 


-0.020 


0.098 


1.2 


RT-e 


-0.031 


0.126 


3.2 (2.8) 


-0.028 


0.137 


3.6 (2.8) 


-0.034 


0.111 


1.4 


-0.034 


0.111 


1.4 



" Percentage of objects with |Az'| = | ^* p ' I > 0.5. The num- 
bers for the cleaned sample excluding objects with discrepant 
ACS/SUPRIMECAM photometry are given in brackets. 



Table 7. Same as Table E]but in two different redshift bins. 





18-band z spcc 


< 1.5 


14-band z spcc 


< 1.5 


18-band z spcc 


> 1.5 


14-band z spec 


> 1.5 


Code 


bias 


scatter 


OUtlF] 


bias 


scatter 


outl.° 


bias 


scatter 


outl." 


bias 


scatter 


outl." 


BP-t 


-0.050 


0.055 


31.4 (27.5) 


0.013 


0.044 


7.2 (4.1) 


-0.019 


0.074 


28.0 (28.9) 


-0.001 


0.075 


35.3 (27.5) 


BP2-t 


0.003 


0.035 


6.8 (4.9) 


0.005 


0.035 


6.5 (4.5) 


0.001 


0.071 


30.7 (25.1) 


0.001 


0.075 


31.0 (31.3) 


EA-t 


0.021 


0.037 


9.9 (3.9) 


0.022 


0.038 


11.9 (4.9) 


0.014 


0.065 


21.3(19.9) 


0.024 


0.062 


22.7 (22.3) 


GA-t 


-0.010 


0.060 


19.7 (14.6) 


0.018 


0.057 


16.4 (12.9) 


0.003 


0.071 


42.7 (42.7) 


0.008 


0.073 


35.0 (34.1) 


HY-t 


-0.003 


0.055 


16.5 (12.9) 


0.018 


0.054 


12.3 (8.9) 


0.014 


0.072 


29.7 (30.8) 


0.021 


0.062 


28.0(18.5) 


KR-t 


-0.012 


0.047 


16.8(11.8) 


-0.011 


0.050 


10.5 (6.1) 


0.026 


0.072 


35.7 (24.2) 


0.042 


0.062 


51.3 (36.0) 


LP-t 


0.005 


0.037 


6.2 (3.2) 


0.008 


0.034 


6.8 (2.8) 


0.002 


0.059 


15.7(16.6) 


0.014 


0.057 


23.0(18.0) 


LR-t 


0.023 


0.059 


10.1 (8.3) 


0.039 


0.053 


15.1 (12.0) 


0.028 


0.079 


41.3 (45.0) 


0.037 


0.070 


39.7 (43.1) 


AN-e 


-0.017 


0.070 


27.6 (25.5) 


-0.010 


0.076 


33.6 (31.6) 


0.051 


0.078 


50.7 (53.2) 


0.045 


0.077 


66.4 (70.3) 


EC-e 


-0.003 


0.065 


16.1 (12.9) 


-0.000 


0.064 


14.5 (11.4) 


0.015 


0.077 


32.3 (32.3) 


0.015 


0.077 


29.5 (26.6) 


PO-e 


-0.012 


0.049 


12.6 (9.6) 


-0.011 


0.047 


9.4 (6.0) 


0.019 


0.075 


48.3 (48.3) 


0.026 


0.074 


37.7 (32.7) 


RT-e 


-0.016 


0.062 


19.6(17.0) 


-0.014 


0.064 


21.1 (18.6) 


0.040 


0.072 


31.8(32.9) 


0.039 


0.071 


41.9 (42.4) 



" Percentage of objects with |Az'| = r^/^°' | > 0.15. The num- 
bers for the cleaned sample excluding objects with discrepant 
ACS/SUPRIMECAM photometry are given in brackets. 



smaller reflects the difficulty of a blind test, but it most proba- 
bly also reflects that our combined spec-z catalogue, explicitly 
including objects from the so-called "redshift-desert", is more 
complete and representative than some other commonly used 
catalogues. Especially at R < 24 our spec-z catalogue is highly 
complete, and also for this bright cut the outlier rates are rather 
large for most codes (see Tableland the right panel of Fig. Hi. 

There are means of identifying outliers (poor fits, broad 
redshift-probability functions, etc.) and photometric catalogues 
can often be cleaned (e.g. by extraction flags) to yield much 
lower outlier rates. Depending on the science application such a 
filtering can be more or less applicable. For example, we showed 
that rejecting objects with problematic photometry can improve 
the situation considerably. However, photo-z's are often used in 
a rather blind fashion without extensive checking (often due to a 
lack of spec-z comparisons) and filtering. Some science applica- 
tions also rely on redshifts for all objects not allowing for filter- 
ing. For those kind of applications the raw numbers reported by 
PHAT1 in Table [5] are more informative than the cleaned ones 
given in brackets. 

The best performance on this data set is achieved by the LP- 
t, BP2-t, EA-t, and BP-t codes, with LP-t showing the small- 
est scatter and outlier rates. The empirical PO-e code follows 



closely. While EA-t and BP-t also performed nicely on the 
PHAT0 test with noise (LP-t was used for the creation of the 
PHAT0 simulations), the good results for PO-e came as a sur- 
prise because this code ranked next to last in the PHAT0 test 
with noise. The sparse training set of PHAT1 (~ 500 objects) is 
apparently large enough to fully exploit the capabilities of PO- 
e because there are not too many degrees of freedom involved 
here. In contrast, the empirical AN-e code that was in the top 
group for PHAT0 fails basically on PHAT1. The training set of 
PHAT1 is too sparse to train the neural network over this large 
redshift range. Neural networks are generally very good at inter- 
polating smooth functions. However, the colour-redshift map- 
ping of galaxies is highly complex in many places. Furthermore, 
there are ambiguities (also called colour-redshift degeneracies 
Bemtez 2000) in a catalogue spanning a large redshift range, i.e. 
objects with very different redshifts and very similar colours. In 
general, neural networks, as the one used in AN-e, are not pre- 
pared to deal with such ambiguities since they only assign one 
output redshift to a particular point in colour space. 

The top group of five (LP-t, BP2-t, EA-t, BP-t, and PO-e) is 
followed by HY-t, KR-t, LR-t, GA-t, EC-e, and RT-e in approx- 
imately this order. HY-t, KR-t and LR-t show some more or less 
pronounced, peculiar features with a number of objects being 
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Fig. 13. Similar to Fig. 12 but showing Az = z sp ec. - z P hot vs. z spe c.- 
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assigned very similar photo-z's (horizontal features in Fig. 12 1. 
These features certainly have a large influence on the statistics 
and prevent those codes from performing as well as the top group 
although their error distribution in the core looks very similar. 
GA-t and EC-e show clearly a larger scatter in the core of the 
error distribution. The distribution for EC-e is smoother but with 
a larger width resulting in the largest scatter (excluding AN-e). 

It is obvious that the empirical codes produce biases that are 
smaller by typically a factor of two compared to the template- 
based codes. The data-model match is by construction better 
in the empirical case. A mismatch in the template-based case 
can be due to both, slightly inaccurate templates and slightly 
inaccurate photometry. It should be noted that it is very hard 
to achieve a photometric cross-calibration accuracy spanning 
the whole wavelength range from the UV to the mid-IR. EC-e, 
which was designed with the goal of being as bias-free as possi- 
ble, shows by far the smallest bias indeed. The combination of a 
machine-learning algorithm and the proper use of PDFs pays off 
here. 



4.4. Results for the 18-band case 



In Figs.[T4]& 15 the results for the 18-band case (i.e. with IRAC 



bands included) are presented. The statistics are also listed in 
Table [5] and the scatter and outlier values for the different codes 



are plotted in Fig. 1 1 in comparison to the ones of the 14-band 
case. 



It is immediately obvious, especially from Fig. 11 that 
not all codes benefit from adding the IRAC photometry. Only 
LP-t, EA-t, LR-t, RT-e, and AN-e show some improvement 
when adding those information about the observed-frame mid- 
IR SEDs of the objects. The outlier rates of LP-t and EA-t de- 
crease by ~ 15% compared to the 14-band case making them by 
far the best codes in this test, together with BP2-t, which basi- 
cally shows the same performance as with 14 bands. Also RT-e 
improves slightly in scatter and outlier rate with 18 bands com- 
pared to 14 bands. The bias and outlier rate of LR-t are decreased 
somewhat but with the trade-off of a slightly larger scatter. AN-e 
does not perform as poorly with 18 bands as with 14 bands but 
is still the least accurate code in this test. 



PO-e, KR-t, HY-t, GA-t, and EC-e show slightly worse per- 
formance than in the 14-band case with approximately con- 
served order. BP-t, however, shows a huge increase in the num- 
ber of outliers by ~ 200% due to a very poor low-z perfor- 
mance. Most of the |Coe et al.| ([2006) templates are undefined 
and must be extrapolated for A > 25600A. These extrapolated 
SEDs have significantly lower fluxes in the mid-IR compared to 
the observed IRAC photometry resulting in a large photo-z bias. 
Re-calibration of the IRAC zeropoints with this template set im- 
proves the situation somewhat, but is not done here for simplic- 
ity. The good performance of BP2-t shows that it is not the code 
but the template set that makes the difference here. 
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4.5. Discussion of the PHAT1 results 

The performance shown by the best codes in the semi-blind 
PHAT1 test with low bias and scatter values in the 4 - 5% 
range is compatible with typical literature values. Only the large 
fraction of outliers (> 7.5%) is worse than expected. We at- 
tribute this to the higher completeness of our combined spec-z 
catalogue besides the presence of objects with unusual SEDs 
and some problems with the combination of space-based and 
ground-based photometry. It should be noted that the PHAT1 
spectroscopic catalogue represents a very deep sample and is not 
purely magnitude-limited. However, such depths are commonly 
used in photometric studies in extragalactic astronomy. We can- 
not fully quantify the fraction of outliers that are due to photom- 
etry problems on the one hand or due to intrinsically problematic 
objects with strange SEDs on the other hand. But the test of LP-t 
without ACS data described in Sect. 4.3 suggests that most of 
the problem seen here is connected to the latter. 

Differences in the accuracy of the codes for the 14-band case 
can mostly be attributed to differences in the template sets and 
priors for the SED-fitting codes on the one hand and differences 
in the training schemes for the empirical codes on the other hand. 
It is not the aim of this study to explain all the features seen in 
this comparison. Rather we want to provide a snapshot of what 
current codes are capable to do in a semi-blind application. 



It is striking that half of the codes perform worse with the 
IRAC photometry included. Especially, the low-z performance 
suffers in this case. For the template-based codes this can be ex- 
plained by insufficient knowledge of the template SEDs in the 
mid-IR^] If the templates do not represent the reality it cannot be 
expected that additional data lead to an improvement. EA-t, the 
only template-based code that really benefits from the informa- 
tion in the IRAC bands, differs from the other template-based 
codes in the sense that it uses a template error function (see 
Brammer et al. 2008, for a detailed description). This feature 
weighs the measurements in the different bands according to the 
estimated accuracy of the template at the rest-frame wavelength 
that corresponds to the effective wavelength of a given filter at 
a particular redshift step before computing the x 1 - This hard- 
coded template error function assigns a low accuracy to the mid- 
IR spectral region of the templates so that the IRAC bands do 
not influence the^- 2 at low-z. At higher redshifts, however, when 
IRAC probes the rest-frame near-IR or optical where templates 
are more accurate, the information is used and can improve the 
photo-z's. That is reflected in the lower bias and outlier frac- 
tion for EA-t in the 18-band case when compared to the 14-band 
case. BP2-t employs a filter error based on the scatter between 
the photometry of best-fit models and observed photometry in a 



9 This is mostly due to insufficient modelling of dust emission fea- 
tures from PAHs. 
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Fig. 15. Similar to Fig. 13 but for 18 bands (i.e. including IRAC bands) 

"T 



0.5 

3 
-0.5 

0.5 
3 
-0.5 

0.5 
3 
-0.5 



BP2-f 



1 1 1 1 1 1 1 1 



I_i i i rjjg^TjJTSJ [ i n i- 1, | j i i '-l^tj^ 1 ^ KV 1 1< 




12340 12340 12340 123 



spec 



spec 



spec 



spec 



particular filter on the spectroscopic training set. This essentially 
down-weights the IRAC bands. In general the mid-IR behaviour 
of the advanced template sets used by LP-t, BP2-t, and EA-t 
seems to be more realistic than the extrapolations employed for 
some other sets leading to better performance with 18 bands. 

The lower bias values produced by the empirical codes sug- 
gest that there are still systematic inaccuracies in most template 
sets. With a sufficient training set such inaccuracies can be re- 
paired by re-calibrating the templates, e.g. with the approach de- 
scribed in Bud avari et al.| l |2000) . Such a better data-model match 
is demonstrated by BP2-t showing consistently the lowest bias 
of all template-based methods which is however still somewhat 
larger than the values for EC-e. 



5. Conclusions 

With PHAT we provide a snapshot of the photo-z accuracy 
achievable with today's methods in semi-blind tests. Most major 
photo-z codes used in the current literature are included in this 
challenge presented here. 

A first test, PHAT0, on highly idealised simulations yields 
good agreement between the different codes (16 participants in 
total) and especially in comparison to the LP-t code that was 
used to create the simulations. Differences are found in the han- 
dling of the opacity of the IGM, which are most likely unimpor- 
tant for practical applications (as long as only broad photometric 
bands are used). 



The PHAT1 test based on real photometric and spectroscopic 
data from the GOODS survey represents a much more difficult 
test environment including many of the challenges encountered 
in practical applications. As expected the results from twelve 
participants show a larger fluctuation in accuracy, but a general 
convergence is seen for most codes, i.e. scatter values and outlier 
rates are within a factor of two of the best code in the test. While 
the best codes perform to expectations in terms of bias and scat- 
ter, some other codes show remaining biases due to a template 
set that does not perfectly fit the data or due to an insufficient 
training set. Half of the codes do not benefit from adding mid-IR 
photometry from the Spitzer Space Telescope. This finding sug- 
gest strongly that there is considerable inaccuracy in some of the 
template sets in the rest-frame mid-IR region of the SEDs. The 
rather large outlier rates reported in this test should be taken se- 
riously since most of these problematic objects are also present 
in purely magnitude-limited photometric samples, but not nec- 
essarily in commonly used spec-z catalogues, which are incom- 
plete at fainter magnitudes. Cleaning of the catalogues is still 
necessary for PHAT1 to reach an outlier rate below ~ 5% for 
the best code in the test. More detailed future studies (possibly 
in the framework of PHAT) are needed to identify the nature of 
this problem and quantify the contributions from multi-colour 
photometry issues on the one hand and objects with intrinsically 
unusual SEDs on the other hand. We believe that solving the 
problem of these outliers lies at the core of future photo-z im- 
provements. It is clear that improved spec-z catalogues which are 
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as complete as possible will be indispensable for such an effort. 
Some science applications that do not rely on complete samples 
of galaxies (like e.g. dark energy studies with weak gravitational 
shear) can greatly benefit from efficient cleaning of galaxy cat- 
alogues. There are ways of considerably improving photo-z ac- 
curacy by rejecting objects with unreliable estimates. It is, how- 
ever, beyond the scope of this study to present strategies on how 
to optimise catalogues for different science applications and how 
to quantify those improvements. 

Photo-z accuracy is of paramount importance for a large 
number of future science projects, ranging from galaxy evolu- 
tion to cosmology. The differences in the performance of the 
different photo-z codes presented here will have a direct impact 
on the power of photometric surveys to answer those scientific 
questions. We did not quantify the impact of photo-z accuracy 
here, but it should be noted that there is still some way to go be- 
fore photo-z's reach the accuracy required for e.g. future full-sky 
dark energy surveys. 

The test environments used in this study are pub- 
licly available at http://www.astro.caltech.edu/twiki_ 
phat/bin/view/Main/WebHome and can be used to assess the 
performance of future methods in comparison to the results pre- 
sented here in a quantitative and unbiased way. 
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